Silicon Valley 2014 - Proposal

Gold sponsors

Back to proposals overview - program

Embracing Failure: Fault Injection and Service Resilience at Netflix

Abstract:

Complex distributed systems fail. They fail more and in different ways as they scale and evolve over time. To provide high service availability in the face of this reality we must embrace failure by designing and testing for it. More importantly we must induce controlled failures on systems to discover vulnerabilities and resolve them. To this end Netflix has created a suite of tools, collectively called the Simian Army, to improve resiliency of our cloud environment. This presentation will cover the motivations for inducing failure in production, the mechanics of how Netflix achieves it, and lessons learned along the way.

​This content will be presented at NANOG61 on June 4th but I don't expect high overlap in audience.

Speaker:

Josh Evans is a 15-year Netflix veteran with a background in ecommerce, video streaming services, distributed systems, tools, and operations. Josh is currently Director of Operations Engineering at Netflix and is responsible for enabling continuous improvement of service quality and acceleration of engineering velocity. Josh’s team creates and evolves Netflix Open Source tools like Asgard and the Simian Army, which includes the Chaos Monkey.

blog comments powered by Disqus
New Relic XebiaLabs Electric Cloud Chef Sumo Logic Ansible PagerDuty CA Technologies Datadog CFEngine Ravello Systems Pertino Netflix ruxit Compuware Internap Elasticbox Librato Puppet Labs SaltStack Cumulus Lumos Labs IBM

Special sponsors

BMC Ansible Box

Silver sponsors

Boundary Dell Software VictorOps Bugcrowd Yelp RedHat

Bronze sponsors

Relevance Lab Salesforce Aerospike

Media sponsors

Velocity Usenix Lopsa Citizen Space